Hand-on_Ex08

Author

XueRong

Introduction

This analysis aims to demonstrate predictive modeling using Geographically Weighted Regression (GWR) and Geographic Random Forest (GRF) techniques in R. The document will guide you through building, calibrating, and interpreting predictive models with geospatial data.

Libraries Setup

Make sure to load the following libraries:

pacman::p_load(sf, spdep, GWmodel, SpatialML, tmap, rsample, Metrics, tidyverse, spgwr)
Warning: package 'SpatialML' is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Warning: 'BiocManager' not available.  Could not check Bioconductor.

Please use `install.packages('BiocManager')` and then retry.
Warning in p_install(package, character.only = TRUE, ...):
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'SpatialML'
Warning in pacman::p_load(sf, spdep, GWmodel, SpatialML, tmap, rsample, : Failed to install/load:
SpatialML

Data Preparation

Load the required datasets and prepare for analysis:

# Load model data
mdata <- read_rds("/Users/sharon/OneDrive - Singapore Management University/ISSS626 Data/Hands-on_Ex/Hands-on_Ex08/data/mdata.rds")

# Inspect the dataset
glimpse(mdata)
Rows: 15,901
Columns: 18
$ resale_price             <dbl> 330000, 360000, 370000, 375000, 380000, 38000…
$ floor_area_sqm           <dbl> 92, 91, 92, 99, 92, 92, 92, 92, 93, 91, 91, 9…
$ storey_order             <int> 1, 3, 1, 2, 2, 4, 3, 2, 4, 3, 3, 3, 4, 3, 2, …
$ remaining_lease_mths     <dbl> 684, 738, 733, 700, 715, 732, 706, 745, 731, …
$ PROX_CBD                 <dbl> 8.824749, 9.841309, 9.560780, 9.609731, 8.351…
$ PROX_ELDERLYCARE         <dbl> 0.2514065, 0.6318448, 1.0824168, 0.3473195, 0…
$ PROX_HAWKER              <dbl> 0.44182653, 0.26972560, 0.25829513, 0.4364751…
$ PROX_MRT                 <dbl> 0.6885144, 1.0969096, 0.8862859, 1.4093169, 0…
$ PROX_PARK                <dbl> 0.7450859, 0.4294870, 0.7800777, 0.1776163, 0…
$ PROX_GOOD_PRISCH         <dbl> 1.2703931, 0.4045792, 2.0942375, 0.1375070, 1…
$ PROX_MALL                <dbl> 0.5534331, 1.0677012, 0.9751113, 1.1752392, 1…
$ PROX_CHAS                <dbl> 1.364596e-01, 2.569863e-01, 1.906189e-01, 2.9…
$ PROX_SUPERMARKET         <dbl> 0.2708222, 0.3101889, 0.3187560, 0.4586748, 0…
$ WITHIN_350M_KINDERGARTEN <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, …
$ WITHIN_350M_CHILDCARE    <int> 6, 5, 2, 3, 3, 2, 3, 4, 3, 2, 4, 4, 4, 5, 2, …
$ WITHIN_350M_BUS          <int> 8, 8, 8, 7, 6, 9, 6, 6, 5, 4, 10, 5, 6, 9, 8,…
$ WITHIN_1KM_PRISCH        <int> 2, 2, 1, 2, 2, 1, 3, 2, 2, 2, 2, 2, 3, 2, 2, …
$ geometry                 <POINT [m]> POINT (29179.92 38822.08), POINT (28423…
mdata$resale_price <- as.numeric(mdata$resale_price)

# Generate coordinates and add them to the data
coords <- st_coordinates(st_centroid(mdata))
mdata$X <- coords[,1]
mdata$Y <- coords[,2]

Data Sampling

Split the data into training and test datasets:

# Data split for training and testing
set.seed(1234)
resale_split <- initial_split(mdata, prop = 0.65)
train_data <- training(resale_split)
test_data <- testing(resale_split)

write_rds(train_data, "/Users/sharon/OneDrive - Singapore Management University/ISSS626 Data/Hands-on_Ex/Hands-on_Ex08/data/train_data.rds")
write_rds(test_data, "/Users/sharon/OneDrive - Singapore Management University/ISSS626 Data/Hands-on_Ex/Hands-on_Ex08/data/test_data.rds")

mdata_sp <- as_Spatial(mdata)

Exploratory Analysis

Before performing the GWR, examine the spatial relationships and dependencies in the data:

# Check for spatial autocorrelation based on available geometries
if (any(st_is(mdata, c("POLYGON", "MULTIPOLYGON")))) {
  message('Polygon geometries found. Using polygon-based neighborhood analysis.')
  listw <- nb2listw(poly2nb(st_as_sf(mdata), queen = TRUE))
} else {
  message('Polygon geometries not found. Using point-based neighborhood analysis.')
  coords <- st_coordinates(st_centroid(mdata))
  listw <- nb2listw(knn2nb(knearneigh(coords, k = 5)))
}
Polygon geometries not found. Using point-based neighborhood analysis.
Warning in knearneigh(coords, k = 5): knearneigh: identical points found
Warning in knearneigh(coords, k = 5): knearneigh: kd_tree not available for
identical points
# Moran's I Test for Global Spatial Autocorrelation
moran.test(as.numeric(mdata[['resale_price']]), listw)

    Moran I test under randomisation

data:  as.numeric(mdata[["resale_price"]])  
weights: listw    

Moran I statistic standard deviate = 202.57, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     9.343997e-01     -6.289308e-05      2.128043e-05 

Geographically Weighted Regression (GWR)

GWR helps to model spatial variability by estimating local relationships. We will fit a GWR model using the gwr function from the spgwr package.

# Set up bandwidth for GWR using Spatial*DataFrame
bw <- GWmodel::bw.gwr(formula = resale_price ~ floor_area_sqm + remaining_lease_mths + PROX_CBD, 
                      data = mdata_sp, approach = 'AIC', kernel = 'bisquare', adaptive = TRUE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
Adaptive bandwidth (number of nearest neighbours): 9834 AICc value: 395915.2 
Adaptive bandwidth (number of nearest neighbours): 6086 AICc value: 393850.5 
Adaptive bandwidth (number of nearest neighbours): 3767 AICc value: 391861.6 
Adaptive bandwidth (number of nearest neighbours): 2337 AICc value: 390007.2 
Adaptive bandwidth (number of nearest neighbours): 1449 AICc value: 387826.1 
Adaptive bandwidth (number of nearest neighbours): 905 AICc value: 385175 
Adaptive bandwidth (number of nearest neighbours): 563 AICc value: 382097.2 
Adaptive bandwidth (number of nearest neighbours): 358 AICc value: 379593.2 
Adaptive bandwidth (number of nearest neighbours): 224 AICc value: 391093.5 
Adaptive bandwidth (number of nearest neighbours): 433 AICc value: 380564.1 
Adaptive bandwidth (number of nearest neighbours): 303 AICc value: 378725.8 
Adaptive bandwidth (number of nearest neighbours): 278 AICc value: 378186.9 
Adaptive bandwidth (number of nearest neighbours): 253 AICc value: 377698.2 
Adaptive bandwidth (number of nearest neighbours): 247 AICc value: 377587.1 
Adaptive bandwidth (number of nearest neighbours): 234 AICc value: 377268.4 
Adaptive bandwidth (number of nearest neighbours): 235 AICc value: 377294.9 
Adaptive bandwidth (number of nearest neighbours): 242 AICc value: 377468.8 
Adaptive bandwidth (number of nearest neighbours): 237 AICc value: 377343.9 
Adaptive bandwidth (number of nearest neighbours): 240 AICc value: 377406.4 
Adaptive bandwidth (number of nearest neighbours): 238 AICc value: 377367.7 
Adaptive bandwidth (number of nearest neighbours): 239 AICc value: 377397 
Adaptive bandwidth (number of nearest neighbours): 238 AICc value: 377367.7 
Adaptive bandwidth (number of nearest neighbours): 238 AICc value: 377367.7 
Adaptive bandwidth (number of nearest neighbours): 237 AICc value: 377343.9 
Adaptive bandwidth (number of nearest neighbours): 237 AICc value: 377343.9 
Adaptive bandwidth (number of nearest neighbours): 236 AICc value: 377323.9 
Adaptive bandwidth (number of nearest neighbours): 236 AICc value: 377323.9 
Adaptive bandwidth (number of nearest neighbours): 235 AICc value: 377294.9 
Adaptive bandwidth (number of nearest neighbours): 235 AICc value: 377294.9 
Adaptive bandwidth (number of nearest neighbours): 234 AICc value: 377268.4 
# Fit the GWR model using Spatial*DataFrame
gwr_model <- GWmodel::gwr.basic(formula = resale_price ~ floor_area_sqm + remaining_lease_mths + PROX_CBD, 
                                data = mdata_sp, bw = bw, kernel = 'gaussian', adaptive = TRUE)

# Summary of the GWR model
summary(gwr_model)
              Length Class                  Mode
GW.arguments     11  -none-                 list
GW.diagnostic     8  -none-                 list
lm               14  lm                     list
SDF           15901  SpatialPointsDataFrame S4  
timings           5  -none-                 list
this.call         6  -none-                 call
Ftests            0  -none-                 list

Visualizing GWR Results

Use thematic maps to illustrate the variation in local coefficients:

# Extract local coefficient values from the model
mdata_sp$gwr_coeff <- gwr_model$SDF$floor_area_sqm

# Convert Spatial*DataFrame back to sf for visualization
mdata <- st_as_sf(mdata_sp)

# Visualize the local coefficients using points instead of polygons
tmap_mode("view")
tmap mode set to interactive viewing
# Map of GWR Coefficients
tm_shape(mdata) +
  tm_dots("gwr_coeff", style = "quantile", title = "GWR Coefficient for Floor Area (sqm)")
Variable(s) "gwr_coeff" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Building a Non-Spatial Multiple Linear Regression

Construct a non-spatial multiple linear regression model for comparison:

# Fit a multiple linear regression model
price_mlr <- lm(resale_price ~ floor_area_sqm + remaining_lease_mths + PROX_CBD, data = train_data)
summary(price_mlr)

Call:
lm(formula = resale_price ~ floor_area_sqm + remaining_lease_mths + 
    PROX_CBD, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-198760  -48288   -2709   40297  476007 

Coefficients:
                       Estimate Std. Error  t value Pr(>|t|)    
(Intercept)           83596.756  11169.080    7.485 7.76e-14 ***
floor_area_sqm         2485.869    101.597   24.468  < 2e-16 ***
remaining_lease_mths    411.002      4.721   87.062  < 2e-16 ***
PROX_CBD             -21764.186    173.923 -125.137  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 70870 on 10331 degrees of freedom
Multiple R-squared:  0.6525,    Adjusted R-squared:  0.6524 
F-statistic:  6466 on 3 and 10331 DF,  p-value: < 2.2e-16
# Save the model
write_rds(price_mlr, "/Users/sharon/OneDrive - Singapore Management University/ISSS626 Data/Hands-on_Ex/Hands-on_Ex08/data/price_mlr.rds")

Geographical Random Forest (GRF)

Use GRF to predict HDB resale prices and evaluate spatial variability:

# Install the ranger package if it's not already installed
if (!requireNamespace("ranger", quietly = TRUE)) {
  install.packages("ranger")
}

# Load the library
library(ranger)
Warning: package 'ranger' was built under R version 4.2.3
# Ensure train_data is in sf format before extracting coordinates
train_data_sf <- training(resale_split)  # This keeps train_data as an sf object

# Extract coordinates before dropping geometry
coords_train <- st_coordinates(st_centroid(train_data_sf))

# Drop the geometry column to prepare the training data for modeling
train_data_no_geom <- train_data_sf %>% st_drop_geometry()

# Fit a Random Forest model using ranger
rf_model <- ranger(formula = resale_price ~ floor_area_sqm + remaining_lease_mths + PROX_CBD,
                   data = train_data_no_geom,  # Use the dataset without geometry
                   num.trees = 500,            # Set the number of trees
                   mtry = 3,                   # Number of variables to possibly split at in each node
                   importance = 'impurity')    # To get the importance of the variables

# Save the Random Forest model
write_rds(rf_model, "/Users/sharon/OneDrive - Singapore Management University/ISSS626 Data/Hands-on_Ex/Hands-on_Ex08/data/rf_model.rds")

# Print model summary
print(rf_model)
Ranger result

Call:
 ranger(formula = resale_price ~ floor_area_sqm + remaining_lease_mths +      PROX_CBD, data = train_data_no_geom, num.trees = 500, mtry = 3,      importance = "impurity") 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      10335 
Number of independent variables:  3 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       1382206218 
R squared (OOB):                  0.9043364 

Interpretation of Results

  • The GWR coefficients map shows where the relationship between resale_price and floor_area_sqm is strongest and weakest.
  • Areas with positive coefficients indicate a positive relationship between the predictor and the response variable.
  • The GWR model has an Adjusted R-squared of 0.6524, while the Random Forest model has an Out-of-Bag R-squared of approximately 0.90, indicating better performance for capturing variance in the Random Forest model.
  • Residual analysis and further model comparison are recommended to understand local patterns and potential spatial non-stationarity. ### Model Comparison Code and Visualizations
# Plotting Variable Importance for Random Forest
importance <- rf_model$variable.importance
importance_df <- data.frame(Variable = names(importance), Importance = importance)

# Bar plot of variable importance
library(ggplot2)
ggplot(importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  labs(title = "Variable Importance from Random Forest Model", x = "Variables", y = "Importance")

# Residual Mapping for GWR
# Extract residuals from the GWR model
mdata_sp$residuals <- gwr_model$lm$residuals

# Convert to sf object for visualization
mdata <- st_as_sf(mdata_sp)

# Map residuals to identify spatial patterns
tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(mdata) +
  tm_dots("residuals", style = "quantile", title = "GWR Residuals")
Variable(s) "residuals" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Conclusion

Geographically Weighted Regression and Geographic Random Forest reveal spatial variability that cannot be captured through global models. These approaches provide critical insights into spatial dependencies and improve the accuracy of predictive modeling in geographically distributed data.